UCSG Shallow Parsing: Optimum Chunk Sequence Selection

نویسندگان

  • B Hanumantha Rao
  • Kavi Narayana Murthy
چکیده

This paper is about syntactic analysis of natural language sentences. The focus is on wide coverage partial parsing architectures. In this work we enhance and enrich the UCSG shallow parsing architecture being developed here over the last many years. UCSG architecture combines linguistic grammars in the form of Finite State Machines for recognising all potential chunks and HMMs to rate and rank the chunks so produced. Here we have explored Mutual Information statistics for rating and ranking chunks as also complete parses (chunk sequences) so as to place the best parses near the top. The main aim of this work is to identify the best word group (chunk) sequence or global parse for a given sentence using a information-theoretic measure called mutual information. This method is based on the hypothesis that the best chunk can be obtained by analysing the mutual information values of the chunk tag n-grams. In the initial version of UCSG, HMMs were local to chunks. Global information such as the probability of a chunk of a given type starting a sentence or the probability of a chunk of a particular type occurring next to a chunk of a given type are also useful. We try to capture this global information in the form of mutual information score and use it in improving the ranks of correct chunks. Combining the two methods, namely HMMs and Mutual Information, to get the full information of a chunk to improve the ranks further, is another important aspect of this work. Later, a Best First Search module uses these ranked chunks to find the final parse. We have also added 1000 sentence manually parsed corpus to the existing 4000 manually parsed data. Experiments on the existing and newly added text corpora are included.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

UCSG: A Wide Coverage Shallow Parsing System

In this paper, we propose an architecture, called UCSG Shallow Parsing Architecture, for building wide coverage shallow parsers by using a judicious combination of linguistic and statistical techniques without need for large amount of parsed training corpus to start with. We only need a large POS tagged corpus. A parsed corpus can be developed using the architecture with minimal manual effort, ...

متن کامل

Chunk Parsing and Entity Relation Extracting to Chinese Text by Using Conditional Random Fields Model

Currently, large amounts of information exist in Web sites and various digital media. Most of them are in natural language. They are easy to be browsed, but difficult to be understood by computer. Chunk parsing and entity relation extracting is important work to understanding information semantic in natural language processing. Chunk analysis is a shallow parsing method, and entity relation ext...

متن کامل

An Algorithm Combining Statistics-based and Rules-based for Chunk Identification of Chinese Sentences

Natural language processing (NLP) is a very hot research domain. One important branch of it is sentence analysis, including Chinese sentence analysis. However, currently, no mature deep analysis theories and techniques are available. An alternative way is to perform shallow parsing on sentences which is very popular in the domain. The chunk identification is a fundamental task for shallow parsi...

متن کامل

Robust German Noun Chunking With a Probabilistic Context-Free Grammar

We present a noun chunker for German which is based on a head-lexicalised probabilistic contextfree grammar. A manually developed grammar was semi-automatically extended with robustness rules in order to allow parsing of unrestricted text. The model parameters were learned from unlabelled training data by a probabilistic context-free parser. For extracting noun chunks, the parser generates all ...

متن کامل

Exploiting Chunk-level Features to Improve Phrase Chunking

Most existing systems solved the phrase chunking task with the sequence labeling approaches, in which the chunk candidates cannot be treated as a whole during parsing process so that the chunk-level features cannot be exploited in a natural way. In this paper, we formulate phrase chunking as a joint segmentation and labeling task. We propose an efficient dynamic programming algorithm with pruni...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007